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Objectives: Measurement of similarities between documents is typically influenced by the sparseness of the term-document 
matrix employed. Latent semantic indexing (LSI) may improve the results of this type of analysis. Methods: In this study, LSI 
was utilized in an attempt to reduce the term vector space of clinical documents and newspaper editorials. Results: After ap- 
plying LSI, document similarities were revealed more clearly in clinical documents than editorials. Clinical documents which 
can be characterized with co-occurring medical terms, various expressions for the same concepts, abbreviations, and typo- 
graphical errors showed increased improvement with regards to a correlation between co-occurring terms and document 
similarities. Conclusions: Our results showed that LSI can be used effectively to measure similarities in clinical documents. 
In addition, correlation between the co-occurrence of terms and similarities realized in this study is an important positive 
feature associated with LSI. 
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I. Introduction 

The spread of electronic medical records increases utilization 
of clinical experience, knowledge and information contained 
in clinical documents. To search, collect and retrieve rel- 
evant information according to the user's need is one of the 
important issues in current medical informatics. 
Various kinds of information technologies have been ap- 



Received for review: July 20, 2010 
Accepted for publication: March 5, 2011 

Corresponding Author 

Jinwook Choi, MD, PhD 

Department of Biomedical Engineering, College of Medicine, Seoul 
National University, 28 Yeongeon-dong, Jongno-gu, Seoul 110-799, 
Korea. Tel: +82-2-2072-3421, Fax: +82-2-745-7870, E-mail: jinchoi® 
snu.ac.kr 



This Is an Open Access article distributed under the terms of the Creative Com- 
mons Attribution Non-Commercial License (http://creativecommons.org/licenses/by- 
nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduc- 
tion in any medium, provided the original work is properly cited. 

© 2011 The Korean Society of Medical Informatics 



phed to medical domain to improve retrieval's efficacy. Mea- 
surement of similarity using vector space model is one of the 
widely used basic methods. Even though the vector space 
model is both efficient and effective, sparseness is a problem 
deteriorating its performance of the retrieval. When com- 
posing a document, people use various expressions, and the 
use of synonyms or acronyms is very common. 

Conventional vector space model regards each term as a 
separate dimension axis. As we use many kinds of different 
expressions on the same thing, there will be a lot of various 
axes regarding one subject. Because of the variety of expres- 
sions, Landauer [1] pointed out that 99% of cells in term- 
document matrix were vacant. In addition, we assume that 
clinical documents will be generated using clinical jargons 
which are relatively small number compared with everyday 
language terms. 

In this study, we review a method in order to measure the 
similarity among documents. In order to see the nature of 
similarity among various documents, we applied latent se- 
mantic indexing (LSI) to newspaper editorials and clinical 
documents. Finally, we analyzed major variables affecting 
the similarity. 
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Deerwester et al. [2] proposed LSI using singular value de- 
composition (SVD), and compared the performance with 
that of SMART system. Through analyzing previous latent 
semantic analysis (LSA) and information retrieval researches 
of her and other scholars, Dumais [3] found the critical fac- 
tors affecting the performance of LSA. They are the number 
of dimensions to reduce to and the diversity, and the size of 
the collection, and the number of singular values extracted 
(Figure 1). 

Hofmann suggested probabilistic LSL which accepted basic 
idea of LSI, and applied the probabihstic model and expecta- 
tion maximization algorithm instead of SVD [4,5]. Adopting 
LSI and pLSI, Blei et al. [6] developed latent dirichlet alloca- 
tion (LDA), as a modified probabilistic topic model. Many 
kinds of variations were studied since then [7-9]. 

There are three usages related to the LSI. One approach is 
to regard LSI as a soft clustering method [10]. Another is to 
emphasize the role of dimension reduction. And the third 
is to focus on the solution to the matter of sparseness. LSI is 
said to be a method to solve the sparseness problem. Instead 
of hyperdimensional space composed of terms' axis, terms 
and documents are located on relatively low dimensional 
space created by LSI. 

II. Methods 

1 . Data Collection 

15,618 deidentified discharge summaries of Seoul National 
University Hospital of year 2003 were collected. They include 
information such as patients' gender, age, admission date, 
discharge date, ward, department, doctors in charge, chief 
complaint, onset of problems, patient status when admitted, 
diagnosis, problem list, test results, discharge type, prescrip- 
tions, appointment, history, results of physical examination, 
progress, things to care about after discharge, operation re- 
cord, and plan for future. Some of those columns are filled 
with free text description, and others contain predefined 
coded data. 
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At each trial, we randomly selected 1,000 documents from 
the collection. At one trial, we collected data from all kinds 
of available columns, and another trial, we collected data 
from columns composed of only free text description. All 
discharge summaries used in this study consist of the mix- 
ture of Korean and English words. 

As the second document collection we collected newspaper 
articles which we assume having variety of expressions and 
terms. We collected 1,000 editorials from three Korean daily 
newspapers pubUshed in either 2008 or 2009. All documents 
were written in Korean, some of which include number, al- 
phabet characters or words, or traditional Chinese letters. 

We classified document sets into three groups; ED, CP, 
and CS. 'ED' means the collections of newspaper editori- 
als. CP means clinical documents with full columns. CS 
means cUnical documents with selected columns. While CP 
contains all sections in the discharge summary, but CS has 
only limited sections such as chief complaint, patient state 
status of patients when admitted, diagnosis, problem list, test 
results, discharge type, history, results of physical examina- 
tion, progress, things to care about after discharge, operation 
record, and plan for future. 

2. Preprocessing 

Korean is classified as an agglutinative language. Most words 
are formed by j oining morphemes together [11]. Morphemes 
in Korean are usually divided into lexical morpheme and 
grammatical morpheme. One has meaning, but the other 
just plays grammatical role in a sentence. While prepositions 
in English are written as separate words in a sentence, gram- 
matical morphemes in Korean are attached to lexical mor- 
phemes in the token. 

In this study, only lexical morphemes are regarded as mean- 
ingful terms, and grammatical morphemes are discarded, as 
stop words are ignored in many information retrieval tasks. 
Lexical morphemes in Korean take similar parts to those 
played by word stems in English at the task of information 
retrieval using bag-of-words method. 




Document (m x x) 




Figure 1. Diagram showing a concep- 
tual description of singular 
value decomposition. 
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Table 1. Number of terms and proportions of zero-filled cells 



No. of unique No. of 


Zero-filled 


terms documents 


cells (o/o) 


ED 17,077 1,000 


99.0 


CF 14,610 1,000 


98.9 


CS 13,809 1,000 


99.0 


ED: editorials, CF: clinical documents with full columns, CS: 


clinical documents with selected columns. 




Table 2. Cosine values which are calculated with term frequency 


(TF) and inverse document frequency (IDF) increased af- 


ter latent semantic indexing (LSI) 




Mean 


SD 


Cosine value before LSI (with TF-IDF) 


ED 0.026 


0.035 


CF 0.034 


0.031 


CS 0.032 


0.030 


Cosine value after LSI 


ED 0.195 


0.121 


CF 0.353 


0.167 


CS 0.256 


0.128 


Increase of cosine value after LSI 


ED 0.169 


0.099 


CF 0.319 


0.155 


CS 0.224 


0.110 



SD: standard deviation, ED: editorials, CF: clinical documents 
with full columns, CS: clinical documents with selected col- 
umns. 



In most cases, clinical documents contain various kinds of 
abbreviations. Most acronyms included in the clinical docu- 
ments were accepted without any change, but very common 
terms with just one letter were replaced with longer forms. 
For example, "A/N/V/D/C {+/-/-/+/-)" were replaced with 
"Anorexia/Nausea/Vomiting/Diarrhea/Constipation {+/-/- 
/+/-)". Typos in the clinical documents were not corrected 
for this study. 

3. Singular Value Decomposition 

At first, term-document matrices were built. Cells of the 
matrices were filled with term frequencies of each lexical 
morpheme. MATLab 7.01 (Math Works, Natick, MA, USA) 
was used for the singular value decomposition of term-doc- 
ument matrices. All three kinds of collections were reduced 
to 100 dimensions. In this study, we chose 100 as the dimen- 
sion size, considering both Deerwester et al. [2]'s research 
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records. R 2.10.0 was used for the statistical analysis and Mi- 
crosoft SQL Server (Microsoft Corporation, Redmond, WA, 
USA) 2008 was used for the flexible management of data. 

Table 1 shows the general information of collections' com- 
position. Landauer's assertion that about 99% of cells of 
term- document matrix are filled with 0 can be proved in 
these collections. 

4. Co-occurrence 

We set three kinds of operational definitions for the co- 
occurrence of terms between documents to measure their 
influence on documents' similarity. The first of them is the 
number of shared terms. Number of shared terms between 
documents means how many kinds of unique terms are 
shared between two documents. The second variable is the 
averaged shared term frequency. Shared term frequencies 
are summed, and divided by the number of unique terms 
shared by both documents. The third variable is the averaged 
unshared term frequency between documents. 

With these variables, this study tries to check the relation 
between the co-occurrence of terms and documents' similar- 
ity empirically with Pearson's correlation. 

III. Results 

1. Document Similarity Measurement 

Similarity between two documents was measured using the 
vector space model. It was calculated before and after LSI 
as described in [2]. All documents in a collection are com- 
bined as pairs, and their similarity is calculated according to 
the method above. From each collection, the similarities of 
499,500 document pairs were measured. 

Table 2 shows that the increase of cosine values between 
documents after LSI in clinical documents is more remark- 
able than editorials. Average similarity between clinical 
documents with full column (CF) is the highest among the 
three collections. 

2. The Characteristics of Co-occurrence among Different 
Collections 

Table 3 shows the results of measuring term co-occurrence 
among various collections. Number of terms shared by two 
documents showed large difference. Clinical documents 
share more terms than editorials. The average number of 
shared terms are 29.37 for clinical documents containing full 
columns, 20.52 for clinical documents having simple col- 
umns, and 16.57 for editorials. 

Figure 2 shows the distribution of co- occurring terms 
among collections. As shown in Table 2, the average number 
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Table 3. Number of co-occurring terms. Terms from editorials (ED), 
clinical documents with full columns (CF), and clinical 
documents with selected columns (CS) 







No. of co- 
terms 


Average 

311 d 1 cu 1 r 


Average un- 

d 1 cu 1 r 


ED 


Mean 


16.57 


1.18 


1.34 




SD 


6.27 


0.15 


0.08 




Min. 


0.00 


0.00 


1.05 




Max. 


180.00 


3.07 


1.77 


CF 


Mean 


29.37 


1.33 


1.37 




SD 


16.99 


0.27 


0.12 




Min. 


2.00 


1.02 


1.00 




Max. 


166.00 


4.00 


2.17 


CS 


Mean 


20.52 


1.14 


1.35 




SD 


16.51 


0.18 


0.13 




Min. 


0.00 


0.00 


1.00 




Max. 


180.00 


5.00 


2.77 



TF: term frequency, SD: standard deviation, Min: minimum. 
Max: maximum. 




Editorial 

Cli. full column 

Cli. selected column 



^ 1 1 1 1 

0 50 100 150 

Number of cooccurring terms between documents 

Figure. 2. Distributions of unique number of shared terms in 

editorials and clinical (Cli.) documents. 

of co-occurring terms are slightly higher in clinical docu- 
ments than editorials. That indicates that clinical documents 
are more generated based on a bit confined jargons than edi- 
torials. The nature of using domain specific jargons can be 
used for the future study in the field of clustering or extract- 
ing topics from document collections. 

3. Evaluation of Co-occurring Term Influence 

To check the correlation between the co-occurrence and the 
similarity between documents, we measured Pearsons corre- 
lation coefficients between each operational variable defined 
above and cosine similarities. According to the Pearsons cor- 
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Table 4. Pearson's correlation between number of co-occurring 
terms and document similarity 







Pearson's correlation 


p-value 




Unique term 


U.bzo 


<U.UUi 




Avg. snared 1 r 


0.559 


<0.001 




Avg. unshared TF 


A C 1 ,1 

-0.514 


<0.00i 


CF 


Unique term 


0.443 


<0.001 




Avg. shared TF 


0.662 


<0.001 




Avg. unshared TF 


-0.417 


<0.001 


CS 


Unique term 


0.624 


<0.001 




Avg. shared TF 


0.364 


<0.001 




Avg. unshared TF 


-0.197 


<0.001 



ED: editorials, CF: clinical documents with full columns, CS: 
clinical documents with selected columns. 



relation analysis, relations between each variable and cosine 
similarities show different aspect by collections (Table 4). 

In ED, the number of co-occurring terms has fairly good 
correlation with document similarity. In contrary, in CF the 
correlation between term co-occurrence and similarity is 
low. However, in CS the correlation shows fairly high level as 
that of ED. So we can say that the effect of co-occurrence on 
document similarity can be regarded similar between edito- 
rial collections and clinical collections of selected columns. 
The reason for showing low correlation of clinical collections 
of full columns is reviewed in discussion section. 

IV. Discussion 

In this study, we evaluated the importance of co-occurrence 
on the document similarity. In order to check whether co- 
occurrence of terms plays a key role on document similari- 
ties, we performed the experiment using three types of col- 
lections. Through the experiments, we found out that co- 
occurrence of terms explain large portion of the similarity 
among documents. 

The results reveal the domain specific document character- 
istics. Similarities between documents were generally higher 
in clinical documents than editorials, which can be inter- 
preted that in clinical field the domain specific jargons usage 
is more popular than newspaper arena. 

The low correlation of co-occurrence and similarity in CF 
implies a significant meaning. CF collection has all kinds of 
columns (subsections) of discharge summary compared with 
CS which has only selected subsections. Subsections such as 
department, ward, and admission date can have many co- 
occurring terms over all clinical documents. Those non- 
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specific generally common terms can affect as noisy terms in 
the collection. In contrary CS does not have those sections, 
which leads to better result in the correlation test. 

The dimension size of 100 was set after a series of dimen- 
sion setting experiments. However, there was no general rule 
for the determination of proper dimension size. Although 
Dumais [3] did not come up with general criterion for rea- 
sonable dimension size for LSI, but according to her experi- 
ence, the dimension size should be large enough to yield 
fairly good performance. In her research, she found out that 
the information retrieval performance in medical collection 
was peak at the dimension size of 90. 

LSI is theoretically based on the co-occurrence of terms 
between documents. However, co-occurrence alone cannot 
explain the effectiveness of LSI in most cases. According to 
Landauer's study [1], the correlation between LSA-measured 
word pair similarities and the number of times they ap- 
peared in the same passage was only a little higher than that 
with the number of times they appeared separately in differ- 
ent passages. He asserted that the number or proportion of 
literal words shared between two passage is not the determi- 
nant of their similarity in LSA [1]. 

Similarity between documents measured after LSI was 
higher than that with term frequency-inverse document 
frequency matrix. Not only the average but also increase of 
cosine values was higher in cUnical documents. Mathemati- 
cally, LSI seems to be effective in exaggerating the similarity 
between documents. But, the similarity was not evaluated 
through the comparison with the relevant documents set. To 
check whether LSI is really better method for the measure- 
ment of similarity between clinical documents, further study 
with gold standard is needed. 

The need of finding the best method for clustering clini- 
cal documents triggers this research. As we expected there 
would be lots of co-occurring jargons in medical field com- 
pared with other fields, we performed an experiment show- 
ing the effect of term co-occurrence on document similarity. 
Through the experiment, we found out that high frequencies 
of term co-occurrence in clinical documents were highly as- 
sociated with similarity among documents. The effect was 
more significant in clinical collections than editorial collec- 
tions. However, in clinical collections there are huge number 
of non-specific co-occurring terms which can be regarded as 
noise terms, we suggest that for the refined further study, the 
preprocessing should be considered seriously. 
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